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Abstract 

We solve the inverse problem of deblurring a pixelized image of Jupiter 
using regularized deconvolution and by sample-based Bayesian inference. 
By efficiently sampling the marginal posterior distribution for hyperpa¬ 
rameters, then the full conditional for the deblurred image, we find that we 
can evaluate the posterior mean faster than regularized inversion, when 
selection of the regularizing parameter is considered. To our knowledge, 
this is the first demonstration of sampling and inference that takes less 
compute time than regularized inversion in an inverse problems. Compar¬ 
ison to random-walk Metropolis-Hastings and block Gibbs MCMC shows 
that marginal then conditional sampling also outperforms these more com¬ 
mon sampling algorithms, having better scaling with problem size. When 
problem-specific computations are feasible the asymptotic cost of an inde¬ 
pendent sample is one linear solve, implying that sample-based Bayesian 
inference may be performed directly over function spaces, when that limit 
exists. 


1 Introduction 

We consider solving a problem in image deblurring using the two frameworks of 
regularized inversion and sample-based Bayesian inference. The computational 
cost of a standard efficient implementation of regularized inversion is taken as 
a benchmark against which we compare the cost of drawing samples from the 
associated Bayesian posterior distribution using algorithms designed for hier¬ 
archical Bayesian models, including the marginal then conditional sampler. A 
detailed comparison of algorithms is presented in an example where simplifying 
assumptions allow all matrices to be diagonalized using the Fourier transform. 
We also compute the posterior mean image in a second example with improved 
prior modeling and without simplifying assumptions to demonstrate that the 
marginal then conditional sampler is feasible, and outperforms regularized in¬ 
version, in the general setting. 

We choose image deblurring because it is a canonical linear inverse problem. 
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when using a low-level representation for the unknown iniagc0. In the idealized 
setting where data and ‘true’ image are functions, and under the usual assump¬ 
tion that the blurring process is linear and shift-invariant, the forward map 
corresponds to convolution of the true image with a fixed point-spread function, 
hence is a Fredholm integral of the first kind. When the point-spread function is 
square integrable the forward map is Hilbert-Schmidt, hence compact, implying 
that the inverse problem is ill-posed [111132]. The discrete problem also displays 
these properties, having an ill-conditioned forward map, and is referred to as a 
discrete ill-posed problem m- 

The use of a low-level representation, in this case a gray-scale pixel image, 
allows image space to be identified with K" and given a normed linear structure 
so the inverse problem is amenable to regularized inversion. Low-level repre¬ 
sentations also occur in exploratory Bayesian analyses, though more developed 
applications typically gain significantly through problem specific intermediate- 
or high-level models (see e.g. [351 El IH])- Low-level image representations 
effectively restrict the prior information that can be imposed to correlations 
between pixels values within neighborhoods in the image. We follow Vogel m 
by using the 2-norm of the graph Laplacian as the regularizing functional, to 
impose smoothness, and follow Bardsley [1] by also using this semi-norm as the 
negative log prior distribution (up to an additive constant). In our experience, 
regularized inversion and the posterior mean over a low-level image model pro¬ 
duce solutions of similar quality. Accordingly, we are primarily interested in 
computational cost, and only to a lesser degree with quality of reconstructions 
or measures of uncertainty. 

Regularization has the traditional advantage of being implemented using ma¬ 
ture and computationally efficient steps, particularly for deblurring by Fourier 
deconvolution as presented in our first computed example. On the other hand, 
sample-based Bayesian inference has the advantage of producing unbiased es¬ 
timates of the unknown ‘true’ image, or properties of that image, even when 
evaluated using a single posterior sample, while multiple samples allow eval¬ 
uation of valid uncertainties. Further, Bayesian inference naturally includes 
estimation of, or averaging over, the effective regularization parameter whereas 
regularization methods require an extra procedure for determining the regular¬ 
izing parameter [MEl]. 

The Bayesian formulation is naturally stated as a hierarchical stochastic 
model that relates measured data, unknown image, and hyperparameters. Typ¬ 
ical sample-based methods for exploring the Bayesian posterior distribution uti¬ 
lize random-walk MCMC over the posterior distribution, which can be very slow, 
especially when compared to efficient computation of a regularized solution. 

However, as we show here, the stochastic model may be factorized by marginal¬ 
izing over the unknown image, and we need only run a general MCMC for the 
low-dimensional marginal distribution over hyperparameters; our main contri¬ 
bution is to observe that the ratio of high-dimensional determinants required 

classification of Bayesian image representations and prior models as low-level, 
intermediate-level, and high-level is given in m- 
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for that low-dimensional MCMC can be computed cheaply, which we demon¬ 
strate theoretically and in the computed examples. Drawing an independent 
sample from the posterior distribution is then dominated by the same linear 
solve required in regularized deconvolution. Since selection of the regulariz¬ 
ing parameter requires many such linear solves, the computational cost of the 
regularized solution is equivalent to drawing many independent posterior sam¬ 
ples, from which robust estimates may be evaluated along with quantified un¬ 
certainties. Thus, in a setting where regularized inversion is often applied as 
an efficient solution method, we demonstrate that Bayesian inference over an 
equivalent model is actually cheaper. The marginal then conditional sampler is 
also cheaper than the block Gibbs sampler that has recently been presented as 
an efficient sampler for the linear-Gaussian inverse problem. Further, because 
the MCMC we implement over hyperparameters can be made independent of 
image size, that MCMC can be performed directly on the infinite dimensional 
image model, when that limit is well defined, with image-size dependence only 
occurring in the setup phase and in the final image-forming step. 

The paper is structured as follows: In Section ll.il we present a problem in 
semi-blind deconvolution, and in Sections [2] and |3] we present formulations for its 
solution via regularized inversion and Bayesian inference, respectively. Subsec¬ 
tion 13.31 introduces the marginal then conditional sampler that allows efficient 
computation. Section 2] compares sampling algorithms for linear-Gaussian in¬ 
verse problems, including the block Gibbs sampler, the one-block algorithm, 
and the marginal then conditional sampler. Numerical comparison of all algo¬ 
rithms is presented in Section [5] to validate theoretical results, in the convenient 
setting where efficient computation is available via the FFT. Numerical imple¬ 
mentation of fast sampling for a more general image model, including nuisance 
pixels and not assuming periodic boundary conditions, is presented in SectionlHl 
Section [7] presents a discussion of results and implications for the infinite dimen¬ 
sional limit. Technical calculations that we use for efficient marginal sampling 
in Section [S] are presented in the Appendix. 

1.1 An example of semi-blind deconvolution 

Figure [T] contains a photograph of Jupiter taken in the methane band (780nm) 
on a grid of size 256 x 256 pixels, each takes an integer value from 0 to 255. As 
can be seen, the image is somewhat blurry; the inverse problem is to recover 
the ‘true’ unblurred version of the image. 

We model the blurring process by convolution with a fixed point-spread 
function that we denote k. Denoting the true unknown image by x and the 
data by j/, the observation process may be written 

y = k * X + rj 

= Ax + rj (1) 

where the forward map A is the linear operator representing convolution and 
rj is an unknown ‘noise’ vector representing measurement errors including dig¬ 
itization. In the convolution form we think of x and y as images in the plane, 
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Figure 1: A blurry photograph of Jupiter taken in the methane band (780nm). 


while in the operator form it is more convenient to write x and y as vectors, with 
observations y € M"* and the unknown x € K.". Through most of this paper we 
compute with images x and y that are p x p pixels in size, i.e., n = m = p^. 
We are interested in the practical case where one wishes to perform inference 
from an existing data set so m is necessarily finittH. However, the size of the 
reconstructed image n is always a modeling choice; we are also interested in how 
that choice affects computational cost, particularly in the infinite dimensional 
limit n —>■ oo. 

Some authors take the infinite dimensional limit to mean that both n —>■ oo 
and m —>■ oo, that is the size of data also tends to infinity, as is common in studies 
of idealized inverse problems (see, e.g. the classic work [^). One needs to be 
aware of the potential confusion, and that results that rest on the assumption 
m ^ oo, such as those presented in [5], may not hold for the case considered 
here. 

Since the point-spread function k is unknown, and the forward map is con¬ 
volution, this deblurring problem is often called blind deconvolution. However, 
for the sake of this example we will assume a form for k as follows. The upper 
right-hand portion of the photograph shows one of the Galilean satellites that is 
small enough to be considered close to a point source, and so we can use that re¬ 
gion of the photograph (32 x 32 pixels), normalized, as an approximation to the 
point-spread function. Hence, we actually implement semi-blind deconvolution. 
It is possible to model and infer the point-spread function within regularized 
inversion or the Bayesian calculation, to implement true blind deconvolutior@, 
though we do not consider it further here. 

^In industrial applications, time and money typically Increase with m, so smaller is better. 

®We set modeling of the point-spread function and blind deconvolution as a challenge 
question for students when we use this as a classroom example. 




The forward map A is often computed efficiently using the fast Fourier trans¬ 
form (FFT). We follow [4] by assuming periodic boundary conditions for x in 
the first computed example (but not in the second). The assumption of periodic 
boundary conditions means that convolution is equivalent to circular convolu¬ 
tion, i.e. 

Ax = k®x, 

so A is diagonalized by the discrete Fourier transform and the action of A can 
be computed in the transform domain for a cost of n multiplications, with a 
further O(nlogn) operations for the inverse FFT to the image domain. 

The point spread function is non-zero for only 32 x 32 pixels and there 
appears to be a black band (of width > 16 pixels) around the edge of the blurry 
photograph. Thus, it is likely that the true image x is black in this 16 pixel 
band and hence no significant artifacts will be generated by assuming periodic 
boundary conditions. For images that are not dark at the edges it is usual to 
zero pad both the data y and recovered image x before calculating the FFT [30] ■ 
Neither of these applications of the FFT (zero padding or not) gives a com¬ 
plete model of the forward map as neither includes the influence of bright pixels 
outside the image region. It is natural in the Bayesian formulation to include 
and marginalize over these ‘nuisance’ pixels, which we do in the second com¬ 
puted example. 


2 Solution by Regularization 


We regularize the inverse of A by penalizing images that are not smooth in the 
sense that a pixel differs from the average of its neighbors, as in [37]. Define 
the neighborhood structure using the usual square pixel lattice; the neighbors 
of pixel i are the pixel locations that are above, below, to the left, and to the 
right of pixel location i. 

Write j ^ i when pixel j is a neighbor of pixel i, and dt for the set of 
neighbors of pixel f, i.e. 

di = {j ^ i\j ~ i} ■ 

Denote by |9i| the number of neighbors of pixel i; \di\ =4 for all pixels when 
periodic boundary conditions are assumed. Define matrix L to be 


Lij — < 


-1 

0 


i=j 

j e di 

otherwise 


( 2 ) 


which is the graph Laplacian on the neighborhood graph. The action of L is 
equivalent to convolution with the 5-point finite-difference stencil [7], i.e., 


Lx = X * 



-1 
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In the periodic case L is also diagonalized by the discrete Fourier transform 
with efficient calculation possible using the FFT. Even without that assumption, 
sparsity of L allows efficient operation, also requiring only 0(n) operations. 

The regularized estimate for the deblurred image is defined by 

Xx = aigm\n\\Ax — y\\^ + \x^ Lx (3) 

X 

for regularization parameter A > 0. The minimizer xx may be calculated by 
solving the generalized deconvolution equations 

{A^A + \L)xx = A^y. (4) 

With periodic boundary conditions, this linear solve requires 0{n) operations 
in the transform domain, with one-off O(nlogn) computing costs when trans¬ 
forming between image and transform domains using the FFT. Without this 
simplifying assumption, efficient solution of the system in (j3]) may be performed 
using a linear solver that exploits the sparsity of A + XL). 

As noted in [i], —h~^L is the discrete Laplacian, where h is the pixel spacing. 
Since h oc 1/p then nL —cV^ (with appropriate boundary conditions) for 
some constant c, as n —>■ oo. The transformations L •<— nL and A ■<— A/n leave 
all terms in the right-hand side of (|31) unaltered, so, for finite n, the use of L 
or the negative discrete Laplacian defines identical sets of deblurred images, 
though with altered value of the regularization parameter. We have used the 
graph Laplacian, as in other computational work |37| . to minimize roundoff 
errors. 

We used the L-curve criterion to select A, using Hansen’s l_corner .m al¬ 
gorithm in regtools [15] that takes 200 solves of (|4|) to find the ‘corner’ of the L- 
curve. By Parseval’s theorem the data misfit and regularization semi-norm may 
be computed in the transform domain, so each solve requires 0(n) operations. 
For our Jupiter example we found a regularization parameter A = 5.6724 x 10“^ 
in 0.517 seconds (including 0.507 seconds for 200 solves) in MATLAB R2012b 
using a Lenovo X230 laptop with an Intel CORE i5 processor. The computed 
deblurred image and L-curve are in Eigure[2j In the L-curve figure, the cluster 
of crosses correspond to sampled images (more on this later). 

In the deblurred image of Jupiter in Figure [2| we see artificial ‘ringing’ phe¬ 
nomena around the satellite and Jupiter. The ringing effect is reduced with 
better modeling of the point-spread function, though damping of the solution 
and some ringing is a typical feature of regularized solutions. 


3 Bayesian modeling and inference 

Bayesian modeling naturally proceeds in a hierarchical manner and allows, one 
could say requires, the specification of distributions over unknown quantities. 
The Bayesian inferential framework accommodates very general models for the 
unknown latent field, allowing representations in arbitrary measure spaces, often 
drawing on ideas from spatial statistics or pattern theory [14]. Here we 
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Figure 2: Left; the regularization estimate of a deconvolved Jupiter computed 
with A = 5.6724 x 10“^. Right; the corresponding L-curve. The tight cluster of 
crosses correspond to sampled images. 


restrict representations to the low-level image model used in the regularized 
solution and reuse model components and make distributional choices to enable 
a comparison of methods that are closest equivalents. 

3.1 Hierarchical stochastic model 

Observed data y is related to the unknown true image x via the observation 
model in O- We assume that the components of the noise vector rj in o 
are independent and identically distributed as a zero mean Gaussian with some 
variance 7 ~^, i.e., rj ^ N( 0 , 7 “^/) where / is the identity matrix. Such a 
model is common in settings where knowledge of instrumentation justifies the 
assumption that errors on individual measurements are independent, with zero 
mean, but with unknown average amplitud(3. Then the distribution over y 
conditioned on image x and precision 7 is 

y\x,-f {Ax,j~^l) . (5) 

We are able to prefer smoothness in the unknown ‘true’ image x by modeling 
a: as a draw from a Gaussian Markov random field (GMRF) that assigns low 
probability to non-smooth images. We use a locally linear GMRF m defined 
by the conditional distributions 

Xi\xo, N 

^Studies in inverse problems that assume m —)• oo have an infinite set of samples of the 
noise^ because the forward map is effectively finite rank. Hence, sample estimates of almost 
any statistic of the noise, including 7, have zero variance; this is one origin of the reducibility 
of the Gibbs sampler noted in that does not occur in the practical case considered here. 
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in which S is an unknown lumping constant. This models each pixel value as a 
Gaussian random variable with mean equal to the average of neighboring pixel 
values, and with some variance controlled by a lumping constant S. Thus, the 
preferred value for each pixel is the average of its neighbors, while the lumping 
constant controls how strongly that preference is asserted. The joint distribution 
over the vector x for given S is then the (intrinsic) multivariate Gaussian 

a;|(j-N(0,(dL)-i) (6) 

where the matrix L is the discrete Laplacian defined in ([J). When assuming 
periodic boundary conditions L has a zero eigenvector (the constant vector) 
and so the prior is improper. However, the posterior is normalizable so all 
marginal and conditional distributions are well dehned. It is usual to regularize 
the inverse in dG]) by adding a small constant (‘a nugget’) to the diagonal of 
L [T5]. Alternatively, we may define L~^ to be the Moore-Penrose inverse of L 
in which case the constant vector is also a zero eigenvalue of the covariance. 

The two parameters 7 and 6 are assumed unknown and so we must also 
specify distributions over these as well. For the present we will define the vector 
parameter 9 = ( 7 ,( 5 ) and simply write the joint distribution as tt{9). 

Combining the stochastic models (O, ([51), and for 6, we may write the linear- 
Gaussian Bayesian model in the slightly more general form 


y\x,9 - 

- N(Aa;,E( 6 »)) 

(7a) 

x\9 '' 

N(^,Q-i(d)) 

(7b) 

9 - 

- 7r(6»). 

(7c) 


This hierarchical stochastic model occurs commonly in statistics [34], in which 
y is observed data, a; is a latent field with mean y, and 0 is a vector of hyperpa¬ 
rameters that model uncertainties in the measurement noise covariance S and 
in modeling of the precision (inverse of covariance) matrix Q of the latent field 
process. In the language of Bayesian analysis, (l7a)) defines the likelihood function 
for unknown x and 9 once data y is observed, (129 is the prior distribution over 
latent held x with hyperparameters 9, and © sets the hyperprior distribution 
over those hyperparameters. 

The prior distribution used in (O has played a role in some of the earliest 
developments in Bayesian statistics. While a comprehensive discussion on (hy- 
per)prior distributions is well outside the scope of this paper, it is interesting to 
note a few details that are relevant to our study. 

The parameter i5 in ([ 6 |) is a positive scale parameter, whose numerical value 
depends on the units chosen for the spacing between pixels; a change of units 
corresponds to the transformations L ^ cL and 6 ■<— S/c, for some c > 0. 
Jeffreys [20] addressed the question of how to set the functional form of 7r(S) so 
that inference over x is independent of the choice of units, and developed what 
is now called the Jeffereys scale prioi0 that gives identieal marginal posterior 

^Jeffreys priors now form a general class of prior distributions (see, e.g. ED). often called 
reference priors, with the modern development under the moniker of objective priors [^. 



distributions for x\y, whatever the scaling of <5. Hence, the Jeffreys scale prior 
is often viewed as being uninformative about the units of <5 since inference on 
X is independent of that choice. In the Jupiter problem, the difference between 
using the graph Laplacian L or the negative discrete Laplacian nL, for finite n, 
is a scaling of 6 by n, just as we observed for scaling A in Section [51 Hence, the 
Jeffreys scale prior produces inference for x that is identical whether L or nL is 
used to define the prior precision in (|6l). 

Bardsley [1] employed a conjugate prior distribution over 5 to enable Gibbs 
sampling, and observed ([H §4.4]) that posterior inference differed when using 
L versus nL. From the perspective of Jeffreys priors this difference is a conse¬ 
quence of the conjugate prior being informative with respect to the scaling of 
(5, including units. The same choice and problem occurs in [2]. 

Simply using the Jeffreys prior in high dimensional settings, such as inverse 
problems, can lead to significant unintended biases EH; see also [3S1 [53] for ex¬ 
amples and resolution. We consider a very positive feature of the MTC sampler 
developed here is that it can operate with any hyperprior distribution tt{9) (that 
can be evaluated) and so the algorithm does not prejudice those considerations. 


3.2 Posterior inference 


The focus of inference is the posterior distribution over unknowns x and 6 con¬ 
ditioned on measured y, given by Bayes’ rule as 


T^[y) 


( 8 ) 


Note that we are performing the standard abuse of notation by using the symbol 
TT to denote any probability density function, and associated distributions, with 
the particular function determined implicitly by the arguments. Solutions and 
uncertainties may be computed as the posterior expectation of some function h 
of x, 

^x,d\y[h{x)] = j hi^x)!: {x,9\y) dxd9 

which implicitly averages over the nuisance parameter 9. Sample-based methods 
use a Monte Carlo estimate of the integral. When (x, 0)^^^,..., (x, 0)*^^^ ^ 
TT (x, 9\y) are iterates of an ergodic Markov chain. 
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E^,e\y[hix)] ft! 

with convergence guaranteed by a central limit theorem ED- 

For computation, it is important to observe that the numerator in ([5]) 


TT {y\x, 9) TT (x, 9) =TT {y\x, 9) tt {x\9) tt (9) 


may be evaluated as the product of the three density functions in the hierarchical 
model ([7]). However the normalizing constant 




J J tt {y,x,9) dxd9 
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is typically not available in the sense that it is infeasible to compute. We 
will assume throughout that 7r(y) is finite, and non-zero. Hence the posterior 
density may be evaluated up to an unknown constant, and therefore can be ex¬ 
plored using Metropolis-Hastings MCMC m- Indeed this is the most common 
method of sampling from the posterior distribution being easy to implement, 
typically employing a random-walk proposal distribution. Examples of such 
methods include the adaptive Metropolis (AM), Metropolis-adjusted Langevin 
(MALA) and hybrid (or Hamiltonian) Monte Carlo (HMC) methods, amongst 
many others. Random-walk Metropolis-Hastings MCMC methods on the full 
state space are typically very slow to converge, not uncommonly requiring 10"^ to 
10 ® iterations for a single independent sample mm. with each iteration requir¬ 
ing simulation of high-dimensional data over a high-dimensional image space. 
We do not implement such a calculation here as the marginal then conditional 
algorithm we present next is several orders of magnitude cheaper. 

3.3 Marginal then conditional sampling 

We propose to significantly speed up sampling by first sampling from the marginal 
posterior distribution over hyperparameters 0 

^ i^\y) = y ^ 

then from the full conditional distribution over x, to give Algorithm [TJ that we 
call marginal then conditional (MTC) sampling. In statistics, this algorithm 
is widely known as the marginal algorithm, see e.g. [5S1 [55]. We prefer the 
more descriptive term, and use it to refer to both the decomposition and the 
computational scheme we present later. 


Algorithm 1: MTC sampling from the posterior distribution 
draw 9 ^ 7r(6»|y) 
draw X ^ TT (x|y, 9) 


Lemma 1. Algorithm\J\ generates a sample from the posterior distribution, i.e., 

{x,9) -7r(x,6»|y). 

Proof. The density function over x and 0 is tt {x\y, 9) tt (0|j/) = tt (x, 9\y). □ 

When the samples over tt {9\y) in Algorithm [T] are independent, then so are 
the posterior samples (x,9). The case where 9 ^ tt {9\y) is generated by one 
step of a (geometrically ergodic) MCMC was considered by Acosta, Huber & 
Jones [I] who called Algorithm [1] a ‘linchpin variable sampler’ with 9 being the 
‘linchpin variable’. They showed that the convergence rate of the chain in (x, 9) 
is the same as the chain in 9. Later, we will use a MCMC to draw samples 
9 ~ 7r(0|y); however, that iteration is sufficiently fast that we will take many 
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steps of the MCMC to generate an effectively independent 9 before drawing 
X ~ T:{x\y,9). The observation that independent 0 gives independent (x, 0) is 
a degenerate form of the result in [T]. 

3.3.1 Marginal posterior for 6 


Lemma 2. 

/ai ^ Tr{y\9,x)Tr{x\9)n{0) 

n{x\0,y)TT{y) 

Proof. n{x,y,9) = tt (x|0, y) 7r(y|0)7r (0) and TT(x^y,9) = TT{y\x,0)'!T{x\9)TT(9). 
Writing Tr{y\9)Tr{6) = Tr{6\y)Tr{y) and using 7r(y) 0, the result follows. □ 


Since tt (y) does not depend on 9, it follows that 


TT {9\y) oc 


TT {y\9,x) 7r(x|d)7r (0) 

7r(x|0,y) 


(9) 


This result is given for Gaussian distributions in [33], though it holds more 
generally as ([31) shows. In principle the right-hand side of (jH]) may be evaluated 
using any value of x for which tt (x|0, y) is appreciable enough to avoid round-off 
issues. In the linear-Gaussian case 0 we can eliminate x to give 


■(^|y) 


det(E-i)det(Q) 


det(Q -f A'^E-^A) 

exp [S"^ - E~'^A{A'^E~^A + Q)~^A'^E~^] (y - A/r)| 7r(0). 


( 10 ) 


This reduction appears to have been overlooked in [33]. Note that this distri¬ 
bution is not Gaussian because of the dependeirce of E and Q on 0 (that is not 
shown for brevity), and that arbitrary hyperprior distribution 7r(0) is allowed. 
For our Jupiter example this formula simplifies to 

7r(0|y) oc exp (A) - (A)^ 7r(0) (II) 

where A = <5/7, and the univariate functions / and g are 

/(A) = y^y - (A'^yfiA^A + \L)-\A^y) (12) 

y(A) = logdet -I- at) . (13) 

Both these functions are analytic, monotonic, and very mildly behaved over a 
wide range of arguments, as can be seen in Figure [3] (periodic case). One would 
therefore expect that efficient computation is possible. We develop efficient 
calculation of the (difference of) functions / and y, and sampling from 7r(0|y), in 
SectionOfor periodic boundary conditions, and for the general case in SectionjO] 
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Figure 3: Functions /(A) and g{X) for the periodic p x p image model 


3.3.2 Full conditional for x 

For the linear-Gaussian hierarchical model o, the full conditional for x may 
be readily determined [HIM] as the multivariate normal distribution 

a;|y,6» - N (14) 

where 

Mv.o = + 

Qx\y,e = Q + ^A. 

Again, we have omitted the dependence of matrices on 6 for brevity. 

For the Jupiter example we have 

x\e, ~ N {[A^A + {6h)L)~^A^y, (yA^A + 5L)~^) . (15) 

An independent sample from this distribution may be computed by solving 

(yA^'^A + 5L) X = 'yA'^y + w (16) 

where w = vi + V 2 with independent m ~ N (0,yA'^A) and U 2 ~ N {0,SL). 

In the periodic case, the FFT diagonalizes all matrices so the cost of sam¬ 
pling the random variables and solving ()16|) is 0{n) operations in the transform 
domain with a further C>(nlogn) operations for each FFT. In the general case 
the cost of sampling random variables remains at 0{n) operations since the co- 
variance for vi is factorized and A is sparse, while the neighborhood definition 
of L which is the covariance for V 2 allows it to be written as a sum over cliques 
of 2 X 2 rank-1 matrices; we call the resulting 0{n) sampler for V 2 assembly by 
cliques because it uses the same decomposition as assembly by elements of a 
finite-element method stiffness matrix. Hence, in both cases, the cost of sam¬ 
pling x\6,y is dominated by the cost of solving (IT6)) . which is precisely the same 
linear solve required in the generalized deconvolution step (j4|). 
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We have written in the form given by Bardsley [3], who called the 
method randomize then optimize. The calculation in (HU had been previ¬ 
ously used, to our knowledge, by Oliver, He and Reynolds in 1996 [55] un¬ 
der the moniker randomized maximum likelihood and more recently in m as 
perturbation-optimization. This formulation for sampling from the full condi¬ 
tional for the latent field was also used by Wikle et al. [31] who solved the system 
using early termination of a conjugate gradient solver, noting that the quality 
of the approximate sample could be controlled by the convergence criterion. 
Inexact solution of ([H followed by a Metropolis-Hastings accept/reject step 
has also been considered; m used early termination of a conjugate gradient 
solver, while [25] established convergence properties when using a finite number 
of steps of a linear iterative solver. 

4 Comparative description of sampling algorithms 

We present two sampling algorithms in addition to the MTC algorithm. The first 
is a block Gibbs sampler [lasKiT] that has been presented as an efficient way 
to sample from high-dimensional linear-Gaussian inverse problems. The second 
is the ‘one-block’ algorithm introduced by Rue and Held [33] that almost always 
improves on the block Gibbs sampler in statistical efficiency while having similar 
cost per iteration. Finally we present the MTG algorithm that has the same 
statistical efficiency as the one-block algorithm but with lower computational 
cost per iteration. 

4.1 Block Gibbs sampler 

Random-walk MGMG sampling of the posterior distribution suffers from slow 
mixing due to high correlations within the distribution over images, and be¬ 
tween the image and hyperparameter. Slow mixing within sampling of the 
high-dimensional image may be alleviated by updating the unknown image in 
a single Gibbs step, i.e., treating the image components in a single block. This 
is feasible because the full conditional distribution over image x, given every¬ 
thing else, is the multivariate normal (fHl) or m so independent samples may 
be drawn using efficient methods from numerical linear algebra for solving the 
system m- 

Block Gibbs sampling proceeds by drawing from the full conditional distri¬ 
butions over image x then hyperpaprameters 7 and 6, repeatedly in sequence. 
Hence it is also necessary to have the full conditional distributions for ^\x,5^y 
and 5\x,^^y available in a form that can be sampled. This is possible when 
using conjugate prior distributions over the hyperparameters. For the Jupiter 
example we follow and use the Gamma distributions 7 ^ and 

5 rias^Ps), i-e.. 


7r(7) ^ iexp(-/3.^7) 

7r((5) ~ (5“'*“^ exp(— 


(17) 

(18) 
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in which a-y, P^,as, Ps are constants chosen to make the hyperprior distributions 
“relatively uninformative” [4]. We use = as = 1 and = Ps = 10“^ as 
in [ 3 ], although this hyperprior distribution can be viewed as informative for 
the scale of S, as mentioned above. The resulting conditional distributions over 
hyperparameters are 



l\x,S,y 

S\x,'y,y 


(19) 

( 20 ) 


The block Gibbs sampler is the only sampler we present that requires a 
mathematically convenient form for the hyperprior distribution. We see this as 
a major disadvantage of (block) Gibbs sampling, for the reasons given at the end 
of Section[3] Despite that objection, we will use the hyperprior distributions (I17|) 
and (1181) in all sampling algorithms to enable exact comparisons. 

A block Gibbs sampler may then be implemented by cycling through sam¬ 
pling from the conditional distributions in (fTSl) . (IT^ and (OUl) to get Algorithm 
[21 as implemented in [HfTB]. Most of the computational cost per iteration is 


Algorithm 2: Gibbs sampling algorithm with blocking of the latent field 
at state x, 9 = ( 7 , 6) 

draw xjy, 5, y ^ N (^{A^A + (S/j)L)~^A'^y, (yA’^A -|- SL)~^) 
draw j\x,S,y ^ r(f -f A^, 5 ||Ax - yf+Pj) 
draw S\x,'y,y r(§ -I- A 5 , 5 ||Aa; - y||^ -I- Ps) 


contained in the draw from the large Gaussian latent field, since that requires 
a solve of (ITBl) . 

In practical inverse problems, it is found that the block Gibbs sampler re¬ 
quires about 10 ^ to 10 ^ iterations per effectively independent sample |H H^[TT| . 
This is about 2 orders of magnitude improvement over naive random-walk 
MGMG directly on the posterior distribution. 

It is well known that the statistical efficiency of (block) Gibbs sampling is 
dependent on parameterization and that the rate of convergence may be im¬ 
proved with an appropriate re-parameterization, see e.g. [29) . However, compu¬ 
tational efficiency in the Jupiter example is not necessarily improved since re- 
parametrization will, in general, require three linear solves per iteration rather 
than one, increasing the cost per iteration. Recent results also show that re- 
parametrization can lead to dimension independent mixing [2]. This does not 
imply dimension independent computational cost since the block Gibbs sam¬ 
pler remains a geometrically convergent algorithm that requires at least one 
linear solve per iteration, whose computational cost increases with dimension. 
In contrast, we will find that the MTG sampler requires just one linear solve 
per independent sample, beyond a fixed setup phase. 
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4.2 One-block sampler 

The one-block algorithm [531 §4-1.2] is usually feasible whenever the calcula¬ 
tions required for the block Gibbs sampler are feasible. Further, the one-block 
sampler almost always has better statistical efficiency than Gibbs, including 
after re-parametrization, and does not require a special form for the hyper¬ 
prior distribution, so should be preferred over the block Gibbs sampler in most 
circumstances. 

The one-block algorithm is so named because the hyperparameter 6 and 
latent field x are blocked together within a single Metropolis-Hastings accept- 
reject step. In this scheme a candidate hyperparameter O' is drawn according 
to some proposal distribution q{0'\O)^ typically a random-walk, then (fT4l) is 
utilized to draw x' conditioned on O' and y. The composite proposal {x',0') is 
then accepted with probability (w.p.) given by the usual Metropolis-Hastings 
rule on the joint posterior distribution. The effective proposal distribution for 
the composite proposal is then 

q {x', 0'\x, 0) =Tr {x'\0', y) q {0'\0 ), 
and the one-block algorithm may be written as Algorithm |3| 


Algorithm 3: One-block algorithm 


at state x, 0 
draw O' - g(6»'|6») 
draw x' ~ TT [x'\0', y) 

accept {x',0') w.p. a{{x,0) {x',0')) = 1 A 

otherwise reject 


Tr{x',O'\y)Tr{x\O,y)q{0\0') 

TT{x,0\y)Tr{x'\0',y)q{0'\0) 


Lemma 3. In Algorithm\^ the transition kernel for the hyperparameter 0 is in 
detailed balance with the marginal posterior distribution for 0\y. 

Proof. 

TT (a;, 6»|?/) = 7r(x|6i,y)7r(6»|?/) 

so the Metropolis-Hastings ratio is (assuming no densities are zero) 

Tr{x',0'\y)Tr{x\0,y)q{0\0') ^ tt {x'\0',y) tt {0'\y) tt {x\ 0, y) q {0\0') 

TT (x, 0\y) TT (x'|6»', y) q {0'\0) tt (x|6», y) tt {0\y) tt ix'\0', y) q {0'\0) 

^ T^{d'\y)q{0\0') 

T:{0\y)q{0'\0) ' 

□ 

Thus, the chain in 0 targets 7r(0|?/), as if we have been able to integrate 
out the latent field. Because this chain makes steps in the marginal posterior 
distribution for 0, rather than the conditional for 0 given the current x, it takes 
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larger steps and the chain in x, 6 converges more rapidly to the joint posterior 
distribution (see [29] noting that in ill-posed inverse problems the data y is 
necessarily ‘weakly informative’ in many dimensions of x). In particular, high 
correlation between hyperparameters and the image, that partially motivates re- 
parametrization of the Gibbs sampler, is irrelevant to mixing of the one-block 
algorithm. Figure 4.1 of [33] illustrates this. 

Evaluation of the acceptance probability in Algorithm requires evaluating 
a ratio of determinants. The ratio of posterior distributions typically does not 
present difficulties since, as in the Jupiter example, the scaling of determinants 
with respect to hyperparameters is simple. However, the ratio of determinants 
associated with the full conditional for x in the proposal presents the traditional 
difficulty. In work by Rue and colleagues (e.g. [33l|34]) it is assumed that sam¬ 
pling from a multivariate normal is performed using Cholesky factorization and 
hence the required determinants are available at a further cost of n multiplica¬ 
tions. However, in very large problems computing the Cholesky factorization 
is prohibitively expensive and in general iterative solvers of ([m are most effi¬ 
cient. Then the required determinants are not directly available. In Section [3] 
we present an efficient method for calculating the required ratio of determinants 
when using iterative solvers for large problems. 

4.3 MTC sampler 

We implement the MTC sampler in Algorithm [1] by performing Metropolis- 
Hastings MCMC sampling directly from 7r(0|j/), shown in Algorithm and 
only after obtaining an (effectively) independent sample 9 tt{ 6\y) do we then 
draw X ^ 7r(a;|0,y) to get an (effectively) independent sample {x,d) ^ TT{x,9\y) 
from the full posterior distribution. If we use the same proposal distribution 


Algorithm 4: Metropolis-Hastings algorithm on T^{9\y) 
at state 0 
draw 9' - g(6»'|6») 


accept 9' w.p. a{9 
otherwise reject 


9') = 1A 


TT{9'\y)q{9\9') 

Tr{9\y)q{9'\9) 


q{9'\9) as the one-block Algorithm (3] then by Lemma [3] both MTC and one- 
block will generate the same chain over 9 and hence these algorithms have the 
same statistical efficiency. However, MTC evaluates the Metropolis-Hastings 
ratio directly using tt {9\y), and thus avoids the cost of the linear solve required 
to draw from the full conditional for x in each MCMC iteration. 

Reduced computational cost is possible when it is possible to cheaply eval¬ 
uate the ratio tt {9'\y )/tt {9\y) required in the Metropolis Hastings acceptance 
probability. For the general linear-Gaussian model © this involves evaluating 
ratios of determinants of Q and Q -I- in CQl), and differences of 

arguments of the exponential (which are also required in the one-block algo- 
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rithm). Efficient calculation of these terms is developed in Sections [S] (periodic 
boundary conditions) and [ 6 ] (general case). 


5 Numerical comparisons for the periodic model 

In this Section we develop specific computational schemes and present numerical 
results for all algorithms applied to the Jupiter deblurring problem, in the sim¬ 
plified setting when periodic boundary conditions are assumed for the unknown 
image of size pxp, as used for regularized inversion computed in Sectionj^l This 
allows us to use the computational cost of regularized Fourier deconvolution as 
a benchmark, being a standard efficient method for image deblurring |30j . The 
EFT also transforms determinants into products to give, at worst, an 0{n) 
calculation. 


5.1 MCMC sampling from 7r{9\y) 


We present two algorithms for the MCMC over 9\y in the MTC sampler: the 
first (Option 1) is a ‘no think’ implementation using a random walk MCMC 
and the 0{n) evaluation of determinants; the second (Option 2) implements 
Metropolis-within-Gibbs over a re-parametrization of 6 that is more efficient 
and uses the efficient calculation of functions / and g detailed in the Appendix 
that is potentially 0(1) in image size, provided that the number of terms XZi 
in the interval [c, c“^] is 0{1) (see the Appendix). Then the on-line cost of 
sampling from 7r(0|y) will be 0(1) as n —>■ oo, so the only on-line computation 
that depends on image size will be the single solve of (fT6)l that generates an 
independent image sample. 

Option 1. From current state 6 = ( 7 , J) propose 9' = (7^ JO according to 


7'|7 ~ N(7,w;2) 

J'|J ~ N(J,rc52), 


that defines the proposal density 



and proceed as in Algorithm |4l This MCMC simplifies to the Metropolis algo¬ 
rithm as the proposal density function is symmetric, hence the Hastings ratio 
q{9\9') /q{9'\9) always equals 1. The simple 0{n) calculation of determinants 
is used. 

A useful guide is to tune Wj and ws until the acceptance ratio is approx¬ 
imately 0.5 in low dimensions and 0.25 in high dimensions [32]. We used 
Wj = 2.34 X 10“^ and ws = 17.28 x 10“^, which correspond to approximately 
1.8 times the standard deviation of ^\y and J|y, respectively. 

Option 2. This option uses a more efficient technique for evaluating / and 
g and a Metropolis-within-Gibbs algorithm with bespoke Gibbs directions to 
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obtain a near-optimal implementation of MTC. Instead of sampling from 5\^, y 
and 7 |(I,y we use the polar coordinates (p = tan“^( 5 / 7 ) and r = and 

sample from r\(j),y and 4>\r,y. We do this because we are able to directly draw 
independent samples from r\(j),y since 




ri ? 


COS (b 

■ as, —^/(tan p) + cos (p + Ps sin p 


We then use one iteration of a Metropolis algorithm to sample from p\r,y that 
has density function 


TT{p\r,y) oc 7 r(^, 7 |y) oc 

(cos(^)“^“^(sin(/))"^^''‘“''“^ exp (—^g{\.a.np) 


r cos p 
2 


/(tan p) — PjT cos p — Psr sin p 


using the symmetric proposal 

q{p'\p) = N{p';p,wl) 

with W 2 = 10“^ chosen so that the acceptance rate is approximately 0.44, being 
the optimum acceptance rate in one-dimension when the target is Gaussian |32j . 
This defines the stochastic iteration for sampling 6 *|y in Algorithm [H 


Algorithm 5: Metropolis-within-Gibbs algorithm on 7 r( 0 |y) for the Jupiter 
example with directions p = tan“^(J/ 7 ) and r = 


at state 9 

draw r\p,y + as, ^^^f{ta.np) + cost/ -I- Ps sint/j 

draw p' ^ q{p'\p) 


accept p' w.p. a{p —>•(/') = 1 A 


T^{(p'\r, y)q{p\p') 

'^{.(PV,y)q{P'\P) 


otherwise reject 


Evaluation of the ratio of marginal densities uses the series expansion of 
functions / and g detailed in the Appendix. 

5.2 Numerical results 

To compare computational efficiencies we implemented the four options for 
sample-based inference, described above, applied to the same posterior dis¬ 
tribution, that is, using the conjugate distributions over the hyperparameters 
given by (HU) and dsni). We initialized each Markov chain at 7 = 0.218 and 
(5 = 5.15 X 10“^ which is the mode of the marginal distribution function TT{0\y) 
over 9, found using MATLAB’s fminsearch function. For the one block algo¬ 
rithm we also used initial x = A -\- 5L)~^^A^y and the same proposal as 
MTC Option 1 . 
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Figure 4: Left; image component of one posterior sample, calculated using the 
MTC Option 2. Right; marginal posterior histograms for 7 , S and S/j. 


In order to evaluate accurate statistics we computed chains of length 10000, 
then, by inspection, discarded a burn in of length 60 for block Gibbs and 20 
for the other algorithms. All computation was performed in MATLAB R2012b 
using a Lenovo X230 laptop with an Intel CORE i5 processor. We used the 
MATLAB’s fft 2 function to diagonalize the action of A and L, and the al¬ 
gorithms were implemented in the transform domain to reduce the number of 
FFTs required. 

Figured shows the image component of a single sample from the posterior 
distribution, and marginal histograms for the hyperparameters 7 , 6 and the 
effective regularization parameter S/j, using the MTC Option 2 algorithm. Note 
that all algorithms we use are provably convergent with the same distributional 
limit, so the particular algorithm we used is not actually important. 

A single posterior sample provides unbiased estimates of any property of the 
posterior image, in contrast to the regularized inverse that necessarily produces 
biased estimates. In this sense, a single posterior image sample could be viewed 
as superior to the regularized solution. The posterior mean image is more usually 
thought of as the Bayesian counterpart to the regularized inverse; we present a 
mean image in Section [S] 

We also calculated sample-based values of \\Ax — y\\ and Vx'^Lx and plot¬ 
ted these as crosses in Figure [2] to indicate the posterior distribution of these 
statistics. Note that the crosses are tightly clustered and lie away from the 
L-curve; no regularized estimate provides a good posterior summary of these 
statistict@. The posterior distribution over effective regularization parameter 

®This observation is not surprising when one notes that the regularized estimate in m 
may be written xx = argmince x^Lx subject to \\Ax — y|p = c for various c > 0. Hence all 
posterior samples lie above the L-curve. In the presence of noise one therefore expects the 
minimizer to display stochastic bias and to be an outlier in the posterior distribution for these 
statistics. We can conclude that optimization does not provide valid summary statistics of 
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Table 1: Compute times and CCES (in seconds) and acceptance rates. 



burn in 

total time 

acceptance rate 

CCES for A 

Block Gibbs 

60 

83.1 

1 

0.17 

One block 

20 

127.8 

0.33 

0.090 

MTG Option 1 

20 

63.1 

0.33 

0.050 

MTG Option 2 

20 

11.8 

0.46 

0.015 


( 5/7 has a sharp peak at 2.37 x 10 which differs significantly from the value 
suggested by the L-curve method. 

5.3 Computational efficiency 

We measure computational efficiency of algorithms by evaluating the computing 
cost per effective sample (CCES), defined as [11] 

tT 

CCES = — 

N 

where r is the integrated autocorrelation time (lACT) for the statistic of inter¬ 
est, T is the total (on-line) compute time, and N is the length of chain (10^ in 
our case). We use the definition 


r = l + 2^Pfc, (21) 

k=l 

where pk is the autocorrelation at lag fc, which gives the length of the chain 
that has the same variance reducing power as one independent sample. Thus, 
CCES is the compute time required to reduce variance in estimates by the same 
amount as one independent sample; smaller is better. We estimate r using 
twic^ the value computed by Wolff’s UWerr.m code [40| for each of 7 , 5, and 
A = 15 / 7 . Autocorrelation functions for A are shown in EigurejS] 

Total compute times, CCES, and lACT are shown for each sampling algo¬ 
rithm in Tables [U and |2j We see that statistical efficiency (lACT) and compu¬ 
tational efficiency (CCES) follow the pattern expected from theoretical consid¬ 
erations, with MTC Option 2 being clearly the most efficient in both measures. 
Option 2 improves on Option 1 in lACT because of the improved MCMC for 0\y 
while compute time per iteration is also reduced due to the efficient evaluation 
of /(A) and 5 (A). 

In Table H] block Gibbs has lACT for 5 much larger than for A, whereas 
the other algorithms have similar lACT for 5 and A. This suggests that high 

the Bayesian posterior distribution. 

^Two definitions of lACT are used in the literature, one being twice the other. Physics 
literature m tends to define lACT as whereas statistics literature m uses the definition 

in fSTJl. 
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Figure 5: Autocorrelation of A = 5/7 for all four sampling algorithms. 


Table 2: Integrated autocorrelation times (in iterations) for three statistics of 
interest. 



7 

5 

A = 5/7 

Block Gibbs 

1.6 

22.3 

21.0 

One block 

7.8 

6.7 

7.1 

MTC Option 1 

7.6 

7.8 

7.9 

MTC Option 2 

2.1 

5.0 

5.7 
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Table 3: Compute time and number of solves required for a regularized image 
or an independent image sample. 



time to A 

solves to A 

time to x 

solves to X 

total time 

Regularization 

0.52 

200 

0.0024 

1 

0.52 

Block Gibbs 

0.85 

102 

0.0020 

0 

0.85 

One block 

0.44 

34 

0.0020 

0 

0.44 

MTG Option 1 

0.23 

36 

0.0096 

1 

0.24 

MTG Option 2 

0.037 

0 

0.0082 

1 

0.045 


correlations between x, 7, and 6 reduce the efficiency of block Gibbs. Results 
for the block Gibbs sampler are shown without re-parametrization that could 
potentially improve its statistical efficiency. Looking at the lACT for A in Ta¬ 
ble [2] one would expect that better parametrization could reduce the lACT of 
block Gibbs in this example to that of MTG Option 2, since that algorithm is 
not affected by correlations between x and 6, and the re-parametrization of 9 
gives efhcient sampling. That is, statistical efficiency of block Gibbs could be 
improved by a little more than factor of 3. Since the re-parametrization of Gibbs 
increases the cost per iteration by a factor of about 3, re-parametrization would 
only slightly reduce the GOES for block Gibbs. This better parametrization of 
the hyperparameters could also be applied to the one-block algorithm to reduce 
its lAGT by a small amount. Thus, we expect that after a re-parametrization 
of both algorithms, the one-block algorithm will remain roughly twice as com¬ 
putationally efficient as block Gibbs sampling, as shown in Table [TJ This agrees 
with the suggestion in [33] that the one-block algorithm is to be preferred over 
Gibbs in most circumstances. 

In Table ID we also see that lAGT for both the one-block algorithm and 
MTG Option 1 are approximately the same which agrees with theory that the 
algorithms are statistically equivalent. MTG Option 1 has a smaller GOES than 
does one-block because evaluating the acceptance ratio for MTG Option 1 has 
approximately half the computing cost of the one-block algorithm. 

Finally, in Table |3I we compare sampling to regularization. Time to X is the 
compute time for finding A by constructing an L-curve for regularization, and 
is 

^((burn) + 2T) 

for sampling algorithms. Time to x is any additional compute time required to 
produce a deconvolved image. For regularization this is the time to solve (|3|), 
for the one-block algorithm it is the cost of a single inverse EFT, and for the 
MTG algorithms it is the cost of drawing a sample from x\y,9 by solving (ITS)) . 

MTG Option 2 is a whole order of magnitude faster than regularization at 
computing a candidate for the deconvolved image of Jupiter. Regularization 
spends the majority of computing time constructing an L-curve to find a suit- 
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able regularization parameter, requiring 200 solves, and then solves the linear 
system o to estimate the deconvolved Jupiter by the MAP estimate oi x\y, 9. 
MTC Option 2 is an order of magnitude faster at computing A = J/y as an 
independent draw from 7r(0|?/), and then draws an independent sample from 
x\y,9 by a single solve of (fTO . 

We also show the number of solves to A, which is ((burn) + 2t) for the block 
Gibbs, one-block and MTC Option 1 algorithms, and the subsequent solves to x 
for each algorithm. Since each algorithm is dominated by the cost of the linear 
solves, these figures give relative compute costs in the general case when linear 
solves are more expensive than 0 {n). 

6 MTC sampling for non-periodic models 

We now present an implementation of MTC sampling for deblurring Jupiter 
using an image model that includes a 16 pixel wide band of nuisance pixels 
around the image region and without the simplifying assumption of periodic 
boundary conditions. We will see that, after an initial off-line computing phase, 
samples from 7r(0|y) can be computed cheaply and independent of the image 
size. 

Let X € M" where n = (256-1-32)^ be the true image augmented with a border 
of 16 pixels, and impose zero Dirichlet boundary conditions beyond that. We 
compute the action of A, A'^, and L directly by convolution using MATLAB’s 
conv2 function. Since A € and L £ are sparse matrices, operation 

by these matrices requires 0 (n) operations but we never assemble or factorize 
the matrices. 

The marginal posterior distribution 7r(0|?/) is given by dill) and we use con¬ 
jugate prior distributions in (fT71) and (fTSl) . as before. To sample from 7r(0|j/) we 
implemented a Metropolis-within-Gibbs algorithm with Gibbs directions 7 and 
A = S/j. We can directly draw independent samples of 7|A,y since 

/ 777. 1 x 

7|A, y"^r(-^-|-Q!5-|- a^, + 1^7 + /^s^J , 

and we use the random walk Metropolis algorithm with proposal A'|A ^ A^(A, rc|) 
with W 3 = 10“^ to sample from Ajy,?/ which has conditional density function 

7r(A|7, y) oc exp ^-^5(A) - ^/(A) - ^^yA^ . 

Efficient implementation of this algorithm then depends on the ability to effi¬ 
ciently evaluate /(A) (for the Gibbs step), /(A') — /(A), and g(A') — ff(A). We 
found that quartic Taylor series expansions of / and g, beyond the zeroth term, 
about the mode Aq = argmax^ 7r(A,7|y), gave sufhciently accurate results for 
the present example. 

Writing B = A'^A -|- AqL, the derivatives of / are 
f(-)(Ao) = (-ir+^k!(A^gf(B-^LrB-\A^y), r=l, 2 ,.... ( 22 ) 
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Using the identity [131 P-29] 


log(det(/ + tF)) = -tr(F")r 

r=l 


the derivatives of g are 

g«(Ao) = (-ir+4r((B-iLr), r = l,2,.... (23) 

We evaluated Monte Carlo estimates of each trace in (E31) by exploiting the 

identity tr((i?“^U)'’) = E[z^(i3“^L)'’z] where each Zi Unif ({—1,1}), see e.g. 
|221 §6]. This calculation makes even order Taylor expansion most convenient 
as the compute work required to evaluate an odd derivative allows the next even 
derivative to be evaluated for free. The accuracy of Monte Carlo estimates and 
the number of terms in the Taylor series may be determined so that Monte Carlo 
and truncation errors are smaller than the relative error inherent in performing 
the linear solve in finite precision, though this gives a conservative bound. 

We used MATLAB’s gmres solver function for each linear solve, restarted 
every 25 iterations, with relative residual tolerance of 10“^. A tighter tolerance 
resulted in very long solve times when A is small. The condition number of B 
is approximately 10^. We found that just 4 samples of z gave sufhciently small 
relative Monte Carlo error. Hence, to obtain the quartic Taylor expansions of 
/ required 3 linear solves, and g{X) — g(Ao) required 4 x 4 = 16 linear solves. 

We used an iterative procedure to find the mode of 7r(7,A|?/), by computing 
quartic Taylor expansions of / and ^(A) — g{X^‘^^) about A^’’^ then using these 
expansions in place of / and g to obtain = arg max^ 7r(A|7, y)/' k{X^'''> I7, y). 

We terminated this algorithm when — A^’’^|/|A^’’^| < 10“^. Starting from 

A*.°^ = 5 X 10“^ this required 3 iterations and 57 linear solves to converge to 
Ao = 4.4879 X 10“^. We tested convergence by also starting from A^°) < Aq and 
found the algorithm converged to the same Aq. 

After computing quartic Taylor expansions of / and g, the cost of computing 
a Markov chain of length 10"^ was negligible compared to a linear solve. Note 
that block Gibbs and the one-block algorithm (not simulated here) both require 
a linear solve per sample, so would require 10^ solves to compute a Markov 
chain of the same length. The 74 = 57 -I- 3 -I- 16 linear solves required before 
computing the Markov chain for MTC is comparable with, though smaller than, 
the 200 solves required to determine A for regularization. 

Figure m shows histograms of 7, d and X = 6/j values computed from the 
MTC sampler. We also computed the marginal posterior mean image, shown 
in Figure [HI as follows. The posterior expectation of any function h{x) may be 
written 

^x,0\y (^)] ^0\y (^)]] 

which is a weighted sum in 9 of expectations over full conditionals in x. This 
allows efficient calculation when the inner expectation y [h (x)] is cheap to 
evaluate; the outer expectation is low dimensional and may be easily estimated 
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Figure 6: Left; mean of posterior image with non-periodic boundary conditions. 
Right; marginal posterior histograms for 7, 6 and S/j. 


via a Monte Carlo integral utilizing samples from the MCMC over 6 \y, or eval¬ 
uated via numerical integration once the marginal posterior distribution over 9 
is well determined. We implemented the latter. In the linear Gaussian problem 
the full conditional for x is Gaussian so any moment may be evaluated this way, 
i.e. for polynomial h. 

The mean in the Jupiter example further simplifies to 

E[a;|y] = JXL)~^A'^yTr{X\y)dX (24) 

with weights for the numerical integration given by the marginal posterior his¬ 
togram for A. This requires as many linear solves as there are bins required for 
the histogram for A. Note that the integration in (l24l) requires the solve in (|4]) 
rather than the solve in (jl6l) that also requires drawing random numbers. 

It is interesting to note that the posterior mean image shown in Figure [6] 
gives a better deblurred image than the regularized solution in Figure^ with the 
bands of Jupiter, and other details, being more clearly defined. This improve¬ 
ment may be the result of better modeling of the image boundary employed 
in this example. Artificial ringing around the satellite remains evident and, as 
for regularization, reduction of this can be achieved by better modeling of the 
point-spread function. The marginal posterior variance of images may be eval¬ 
uated to provide valid uncertainties, as indicated above, though we have not 
made that calculation for this example. 

The posterior histogram for A in Figure |6] shows that A is effectively sup¬ 
ported on |A — Ao| < 3 X 10“^. Since f^^\Xo) = 5.04 x 10^® and g^^\Xo) = 
4.2 X 10^®, the truncation error for / and g is approximately 3 x I0“®. We also 
evaluated the Monte Garlo error in estimates of the trace for each derivative of 
g. This was 9.7 x 10® for g'(Ao) and 2.4 x 10® for g"(Ao). Hence, the error in 
the linear term for g{X) — g{Xo) is bounded by approximately 3 and the error in 
the quadratic term is approximately 10“®. The errors in the cubic and quartic 
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terms are 10“^ and 10“^, respectively. Small relative errors in the linear terms 
correspond to a small relative scaling in the hyperparameters, which does not 
affect inference in x when hyperpriors are uninformative in this scale near Aq. 
Thus the dominant error is the quadratic term of 10“^, which corresponds to 
a small relative error in the variance of the marginal posterior distributions 
over 5 and 7 . Using multiple centers in a piecewise expansion provides a simple 
mechanism for reducing the effect of errors to any desired level. 

7 Discussion 

We considered posterior sampling for a canonical linear inverse problem with 
n = 65536 and n = 82944 unknowns using a Bayesian hierarchical model with 
Gaussian likelihood and GMRF prior. In the computed example we used a 
conjugate hyperprior distribution to enable block Gibbs sampling. By using the 
observation in [ 1 ] that sampling from the full conditional distribution over the 
latent field may be performed by a regularized solve (plus samples from standard 
normals), we focused effort on sampling from the marginal posterior distribution 
over hyperparameters. Our main contribution is to show that an MGMC over 
that marginal distribution can be cheap and fast, by reducing the computation 
within iterations to the evaluation of two smooth functions of a single variable, 
to give an algorithm that we call marginal then conditional (MTG) sampling. 

In the first computed example we compared the computational cost of the 
MTG sampler with the cost of regularized Fourier inversion and also a range 
of MGMC sampling algorithms that have been applied to the linear-Gaussian 
problem; the MTG sampler outperforms all these algorithms in the sense that 
the compute time required to generate an independent sample from the posterior 
is less than required for the alternative MGMC algorithms, and also an order 
of magnitude less than the cost of regularized inversion. The latter result is, 
perhaps, the most surprising as MGMC is often viewed as necessarily slower 
than deterministic algorithms. However, that is clearly not the case, as we 
have shown. We also computed a second example to demonstrate how MTG 
sampling may be implemented in the more general setting, where A and L are 
not simultaneously diagonalizable using the FFT. By counting linear solves that 
dominate computational cost, we saw that the cost of calculating the posterior 
mean image was still less than the cost of the regularized inverse, by about a 
factor of 2 in the second computed example. 

Posterior statistics that we evaluated differed significantly from the values 
given by t he reg ularized solution. Figure [5] shows that the posterior distribu¬ 
tion for VxALx and data misfit \\Ax — y\\ is supported well away form the 
L-curve, so no value of regularizing parameter gives a meaningful summary of 
these statistics. The same conclusion holds for A = S/'y. This demonstrates 
that analyses of inverse problems that purport to be ‘Bayesian’ but then cal¬ 
culate solutions by optimization, and hence actually implement regularization, 
are evaluating summary statistics that may not even be in the support of the 
posterior distribution. 
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The decomposition used in the MTC sampler, in Algorithm [U is widely 
known in statistics as the marginal algorithm. An example is the ‘linchpin vari¬ 
able sampler’ presented in [I]; that algorithm was defined with a draw from 
the full conditional for x per iteration of an MCMC on the marginal posterior 
over hyperameters. The MTC sampler differs in that the MCMC over hyperpa¬ 
rameters is run until an effectively independent sample is generated, and only 
then is a sample generated from the full conditional for x. Agapiou et al. [2] 
also implemented the marginal algorithm, sampling from the marginal posterior 
over the hyperparameter <5, with 7 assumed known. They viewed the marginal 
algorithm as the “gold standard” finding, as we do here, that it is optimal for 
statistical efficiency, though also referred to it as “prohibitively expensive for 
large scale inverse problems”. We have been able to improve on that situation 
by implementing efficient computation within the MTC algorithm that enables 
inference over all hyperparameters and that outperforms other sampling algo¬ 
rithms in both statistical and computational efficiency, especially for large image 
size. Indeed, we think that the computational scheme in MTC may be close to 
optimal as the computational cost per independent sample is dominated, for 
large n, by the cost of a single linear solve. 

Agapiou et al. [5] restricted their consideration to Gaussian prior models with 
trace class covariance in the infinite-dimensional limit, and finite-dimensional 
discretizations. We followed [smi] by using a scaled graph Laplacian preci¬ 
sion in two dimensions. This does not generate a trace class covariance, which 
shows up as the log singularity in fundamental solutions of the Laplacian in 
two dimensions. Thus the effective prior covariance function that we use has a 
logarithmic singularity, which seems undesirable from a modeling perspective. 
The singularity also leads to mesh dependence in solutions; this problem is not 
so evident when using a structured mesh, as in our examples, since truncation 
of the singular covariance is roughly the same across the mesh and so the vari¬ 
ance is roughly stationary across the mesh. However, when using unstructured 
meshes the variance typically varies dramatically across the mesh, and so is un¬ 
suitable for applications. For these reasons, we also advocate restricting models 
to trace class covariances. 

Efficient computation in MTC relied on being able to evaluate the univari¬ 
ate functions / and g in (IT^ and 0 . When the functions / and g are well 
defined in the limit n —>■ 00 , pre-evaluation of these functions allows sampling 
independent 9\y ra. the oo-dimensional case, with a single linear solve required 
to generate an independent posterior sample x,9\y, using, for example, an iter¬ 
ative solver directly on function space [33]. We have presented computational 
schemes that are sufficient to perform efficient calculation in the examples pro¬ 
vided. However, we expect that these computational schemes can be signifi¬ 
cantly improved upon, particularly for large n and in other applications, and 
see this as a potentially fruitful topic for future research in computational UQ. 
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A Expansions of /(A) and g{X) in terms of A 

For selected tolerance e > 0 and s G N choose c/, Cg G (0,1) such that = 

e and = e. In practice we should tune c/, Cg, and s to gain efficiency. 

Let y ■= DFT(y), and let A and L be the vectors associated with using FFTs 
to evaluate A and L times a vector (they are the diagonal of the diagonalised 
matrices). Define Zi := Li/\Ai\'^ for all i. Then using Parseval’s theorem we 
obtain 



Since the determinant of a matrix is the product of its eigenvalues we also obtain 


n 


n 


5(A) = log I Aip + ^ log(l + xzg. 


Reorder the indices so that {Zi}"^^-^ is increasing and in 0{n) operations, using 
cumulative sums, precompute a s x n matrix 5", a (s + 1) x n matrix T, and 
s X n matrices U and V with entries 



Also precompute scalar a and n-vector b with entries 


n 


n 




for s = 1 ,..., n. 


The on-line calculation of /(A) and g(A) are described by the following lemmas. 
Lemma 4. For each evaluation of f{X) define mi and m 2 such that 


XZi < Cf for i < mi and XZi > Cj ^ for i > m 2 - 


Then for some E satisfying \E\ < e we have 



+ Y^i-ir+^X^Srm, + ^(-1)"A-"T,^, + E. 
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Proof. We first split /(A) into three terms, /(A) = Ti + T2 + Ts- The partial 
geometric series = {1 — z + z'^ — ... + (— ^ for \z\ < 1 
implies 


Ti = 


mi 

E' 


Ivt 


AZ,; 


Tl 1 “h XZi 


= E(-i) 

r—l 


r+1 


X^Sr 


where \Ei\ < since AZ^ < c/ for i < mi. 

Similarly, for Ta we use the partial geometric series ^ 


1 + 2 


El 


TTT = { 1 -Z ^ + Z 2- 


... + (-l)®z-®) + 


1 + 2 - 


for |2:| > 1 to obtain 


r = V 1 

^ n l + (AZ,)-i 


^(-1)’-A"T,™,+£;3 

r=0 


where [ifal < Sr=m2 since (AZ^) ^ < c/ for i > m2. Hence result, 

noting that E = Ei P E^, satisfies \E\ < J 27 =i \Vi^ 1 "^ = 

Lemma 5. for each evaluation of g(\), define mi and m2 such that 
XZi < Cg for i < mi and XZi > Cg^ for i > m2. 

Then for some E satisfying \E\ < e we have 


m 2 —1 

g{X) = a + 6^2 + (n - TO 2 -b 1) log(A) + ^ log(l + AZi) 

mi+l 

^ 1 — If ’—1 ^ ('_ 1')’'-1 

+ --- X^Urmi ”b - - -A ^Vrm2 "b E. 

r—l r=l 

Proof. We first split g{X) into four terms, g{X) = a + Ti + T2 + T3. 

Using the series expansion log(l + z) = J2t=i + (~1)* fo 

any \z\ < 1 we obtain 

mi s / -.Nr-l 

Tl = ^ log(l + AZ,) = ^ ^ -U^^A-- + fi 


where |fi| = 


,= 1 r=l 

mi / -I 'IS c>'Zi t‘‘ 


ETM-^rio 1 +* 

Similarly, for Ta we obtain 


dt 


< mi(AZi)®+^ < mic®+^ 


IL IL / 

73=^ log(AZi) + ^ log (1 + 


AZ, 


= (n — m2 ~b 1 ) log A + bm2 + - Vrm2^ “b f 3 


where jfal = 


J:tm2 (- 1 )^ ITbdt <in-m2 + l)c®+b 


Hence result, noting that f = fi + fa, so \E\ < \Ei \ + jfa] < nc®+^. □ 
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